{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas处理分析网站原始访问日志\n",
    "\n",
    "目标：真实项目的实战，探索Pandas的数据处理与分析\n",
    "\n",
    "实例：  \n",
    "数据来源：我自己的wordpress博客http://www.crazyant.net/ 的访问日志    \n",
    "\n",
    "实现步骤：  \n",
    "1、读取数据、清理、格式化  \n",
    "2、统计爬虫spider的访问比例，输出柱状图  \n",
    "3、统计http状态码的访问占比，输出饼图  \n",
    "4、统计按小时、按天的PV/UV流量趋势，输出折线图  \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1、读取数据并清理格式化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "\n",
    "from pyecharts import options as opts\n",
    "from pyecharts.charts import Bar,Pie,Line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取整个目录，将所有的文件合并到一个dataframe\n",
    "data_dir = \"./datas/crazyant/blog_access_log\"\n",
    "\n",
    "df_list = []\n",
    "\n",
    "import os\n",
    "for fname in os.listdir(f\"{data_dir}\"):\n",
    "    df_list.append(pd.read_csv(f\"{data_dir}/{fname}\", sep=\" \", header=None, error_bad_lines=False))\n",
    "\n",
    "df = pd.concat(df_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[[0, 3, 6, 9]].copy()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns = [\"ip\", \"stime\", \"status\", \"client\"]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2、统计spider的比例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"is_spider\"] = df[\"client\"].str.lower().str.contains(\"spider\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_spider = df[\"is_spider\"].value_counts()\n",
    "df_spider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bar = (\n",
    "        Bar()\n",
    "        .add_xaxis([str(x) for x in df_spider.index])\n",
    "        .add_yaxis(\"是否Spider\", df_spider.values.tolist())\n",
    "        .set_global_opts(title_opts=opts.TitleOpts(title=\"爬虫访问量占比\"))\n",
    ")\n",
    "bar.render_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3、访问状态码的数量对比"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_status = df.groupby(\"status\").size()\n",
    "df_status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(zip(df_status.index, df_status))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pie = (\n",
    "        Pie()\n",
    "        .add(\"状态码比例\", list(zip(df_status.index, df_status)))\n",
    "        .set_series_opts(label_opts=opts.LabelOpts(formatter=\"{b}: {c}\"))\n",
    "    )\n",
    "pie.render_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4、实现按小时、按天粒度的流量统计"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"stime\"] = pd.to_datetime(df[\"stime\"].str[1:], format=\"%d/%b/%Y:%H:%M:%S\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.set_index(\"stime\", inplace=True)\n",
    "df.sort_index(inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 按小时统计\n",
    "#df_pvuv = df.resample(\"H\")[\"ip\"].agg(pv=np.size, uv=pd.Series.nunique)\n",
    "\n",
    "# 按每6个小时统计\n",
    "#df_pvuv = df.resample(\"6H\")[\"ip\"].agg(pv=np.size, uv=pd.Series.nunique)\n",
    "\n",
    "# 按天统计\n",
    "df_pvuv = df.resample(\"D\")[\"ip\"].agg(pv=np.size, uv=pd.Series.nunique)\n",
    "\n",
    "df_pvuv.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "line = (\n",
    "        Line()\n",
    "        .add_xaxis(df_pvuv.index.to_list())\n",
    "        .add_yaxis(\"PV\", df_pvuv[\"pv\"].to_list())\n",
    "        .add_yaxis(\"UV\", df_pvuv[\"uv\"].to_list())\n",
    "        .set_global_opts(\n",
    "            title_opts=opts.TitleOpts(title=\"PVUV数据对比\"),\n",
    "            tooltip_opts=opts.TooltipOpts(trigger=\"axis\", axis_pointer_type=\"cross\")\n",
    "        )\n",
    "    )\n",
    "line.render_notebook()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
